Genetic Sequences – Tracing the Mutations of a Disease

"DMWS-MTA SZTAKI"
Data Mining and Web search Research Group
Computer and Automation Research Institute, Hungarian Academy of Sciences

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Eszter Friedman, MTA SZTAKI, feszter@info.ilab.sztaki.hu
Julianna Göbölös-Szabó, MTA SZTAKI, gobolos.szabo.julianna@gmail.com
Adrienn Szabó, MTA SZTAKI, adrienn.szabo4@gmail.com, [PRIMARY contact]
András Lukács, MTA SZTAKI, alukacs@sztaki.hu

Tool(s):

Our in-house-made tool builds a tree (using the Neighbor-joining algorithm; http://en.wikipedia.org/wiki/Neighbor-joining) from the input DNA sequences and shows the likely evolutionary paths by displaying the tree. Visualization of the computed phylogenetic tree reflects descendance relations in the tree. Colors were added to the nodes in order to represent properties of viruses/sequences. Building the tools took about 20 days work.

The tool was developed by Adrienn Szabó using the processing.org environment.

Video:

Video

ANSWERS:

MC3.1: What is the region or country of origin for the current outbreak? Please provide your answer as the name of the native viral strain along with a brief explanation.

The evolution of the current Drafa virus can be visualized by a phylogenetic tree (dendogram) which shows the most probable parent-child relationships among different virus strains. The screenshot shows that the virus strain that is the closest ancestor of all the current outbreak sequences is “Nigeria_B”.

MC3.2: Over time, the virus spreads and the diversity of the virus increases as it mutates. Two patients infected with the Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence 583. One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each patient. Which patient likely contracted the illness from Nicolai and why? Please provide your answer as the sequence number along with a brief explanation.

The patient with sequence 123 has much greater probability of being infected by Nicolai Kuryakin, as his virus strain is closer to NK's strain in the tree (in fact, sequence 123 is a direct child of sequence 853). If we check the difference between these two strings (by clicking on the child node), we can see that they differ in one position only. On the other hand, sequence 51 is to one mutation distance from sequence 531 (a common ancestor), and sequence 583 is to two mutations distance from the same node (and the mutations happened in all different positions) so virus strain 51 and 583 are different in three positions.

MC3.3: Signs and symptoms of the Drafa virus are varied and humans react differently to infection. Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic). The mutations involve one or more base substitutions. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

Our tool is coloring the nodes of the tree according to the provided data. Green means a not very dangerous virus strain (low mortality, mild symptoms, etc.) , red means severe problems (high mortality, severe symptoms, etc.), and yellow means a dangerousness level inbetween.

So color changes between a parent and its child show the mutation's effect on the aggressiveness of the virus. The worst changes can be highlighted by adjusting a scrollbar (named “Dangerous mutation threshold”).

When a node of the phylogenetic tree is clicked, the virus's characteristics are shown in a table, and the mutations between the selected node and its parent are also listed. We found that the worst mutations according to symptom severity are (positions are counted from 0):
A → G, 222

A → T, 945

A → C, 268

MC3.4: Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question. To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions. In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

We developed a “Phylogenetic Tree Visualizer” to show connections and changes between the members of a family of viral strains. The development process took only about a month, thanks to the Processing.org environment. We implemented a neighbor-joining algorithm which is capable of building an evolutionary tree efficiently if we have pairwise distances of the input sequences. To get distances between any pair of sequences, the number of positions with different bases were simply counted.

The tool helps finding specific nodes by providing a “Search” box into which the name of the sequence must be typed. If the tree contains a node with the given name it will be highlighted with a blinking red circle.

Basically the same method was used to find the exact answer to this question as in the previous section. Though an extra pre-processing step was needed: a function in the program averaged all the disease characteristics to get a new, aggregated virus characteristic, named “Dangerousness”. The averaging happened the next way: we assigned weight 1.0 to all the dangerous characteristics (like “high mortality” and “major complications”); weight 0.5 to the medium ones (like “moderate symptoms”), and weight 0.05 to the not-dangerous characteristics (“low mortality”, “minor drug-resistance”, etc). This means that the worst characteristics are a bit overweighted, so our tool will highlight medium to worst characteristic changes a bit worse than mild to medium changes.

The mapping from dangerousness or any other characteristic value (values between 0 and 1) to color (green to red) was quite simple because we could use the HSV color mode of Processing.org, and we needed to change the Hue only.

So, here are the steps to find the most dangerous mutations: The user chooses “Dangerousness” as coloring mode. Then the tree is displayed with the averaged node colors described in the previous paragraph. The “Dangerous mutation threshold” slider can be used to highlight the most dangerous changes (when the difference in dangerousness between a parent and its child is over a threshold – the threshold is bound to the slider). It is easy to see that highlighted nodes (with a black ring around them) are really the ones with a greenish parent (that means not really dangerous) and a yellowish (somewhat dangerous) or reddish (dangerous) child. If the user clicks on a node then its original properties (characteristics) can be seen in the table in the lower left corner, so one can check if the disease characteristics are really dangerous.

We found that the worst mutation is G → C, 847 at node 501, and then three equally dangerous mutations are: A → T , 679; C → T, 643; T → G, 664 at nodes 525, 299 and 93 respectively, shown in the screenshot.

"DMWS-MTA SZTAKI"Data Mining and Web search Research GroupComputer and Automation Research Institute, Hungarian Academy of Sciences

VAST 2010 Challenge Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Tool(s):

"DMWS-MTA SZTAKI"
Data Mining and Web search Research Group
Computer and Automation Research Institute, Hungarian Academy of Sciences

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease